---
title: Readers
description: Learn how to use readers to convert raw data into searchable knowledge for your Agents.
keywords: [readers, knowledge]
---

Readers are the first step in the process of creating Knowledge from content.
They transform raw content from various sources into structured `Document` objects that can be embedded, chunked, and stored in vector databases.

## What are Readers?

A **Reader** is a specialized component that knows how to parse and extract content from specific data sources or file formats. Think of readers as translators that convert different content formats into a standardized format that Agno can work with.

Every piece of content that enters your knowledge base must pass through a reader first. The reader's job is to:

1. **Parse** the raw content from its original format
2. **Extract** the meaningful text and metadata
3. **Structure** the content into `Document` objects
4. **Apply chunking** strategies to break large content into manageable pieces

## How Readers Work

All readers inherit from the base `Reader` class and follow a consistent pattern:

```python
# Every reader implements these core methods
class Reader:
    def read(self, obj, name=None) -> List[Document]:
        """Synchronously read and process content"""
        pass

    async def async_read(self, obj, name=None) -> List[Document]:
        """Asynchronously read and process content"""
        pass
```

### The Reading Process

When a reader processes content, it follows these steps:

1. **Content Ingestion**: The reader receives raw content (file, URL, text, etc.)
2. **Parsing**: Extract text and metadata using format-specific logic
3. **Document Creation**: Convert parsed content into `Document` objects
4. **Chunking**: Apply chunking strategies to break content into smaller pieces
5. **Return**: Provide a list of processed documents ready for embedding

### Content Types and Specialization

Each reader specializes in handling specific content types:

```python
@classmethod
def get_supported_content_types(cls) -> List[ContentType]:
    """Returns the content types this reader can handle"""
    return [ContentType.PDF]  # Example for PDFReader
```

This specialization allows each reader to:
- Use format-specific parsing libraries
- Extract relevant metadata
- Handle format-specific challenges (encryption, encoding, etc.)
- Optimize processing for that content type

## Reader Configuration

Readers are highly configurable to meet different processing needs:

### Chunking Control

```python
reader = PDFReader(
    chunk=True,                    # Enable/disable chunking
    chunk_size=1000,              # Size of each chunk
    chunking_strategy=MyStrategy() # Custom chunking logic
)
```

### Content Processing Options

```python
reader = PDFReader(
    split_on_pages=True,          # Create separate documents per page
    password="secret123",         # Handle encrypted PDFs
    read_images=True             # Extract text from images via OCR
)
```

### Encoding Control

For text-based readers, you can override the file encoding:

```python
reader = TextReader(
    encoding="utf-8"              # Override default encoding
)

reader = CSVReader(
    encoding="latin-1"            # Handle files with specific encodings
)

reader = MarkdownReader(
    encoding="cp1252"             # Windows-specific encoding
)
```

### Metadata and Naming

```python
documents = reader.read(
    file_path,
    name="custom_document_name",  # Override default naming
    password="file_password"      # Runtime password override
)
```

## The Document Output

Readers convert raw content into `Document` objects with this structure:

```python
Document(
    content="The extracted text content...",
    id="unique_document_identifier",
    name="document_name",
    meta_data={
        "page": 1,                # Page number for PDFs
        "url": "https://...",     # Source URL for web content
        "author": "...",          # Document metadata
    },
    size=len(content)             # Content size in characters
)
```

## Chunking Integration

One of the most important features of readers is their integration with chunking strategies:

### Automatic Chunking

When `chunk=True`, readers automatically apply chunking strategies to break large documents into smaller, more manageable pieces:

```python
# Large PDF gets broken into multiple documents
pdf_reader = PDFReader(chunk=True, chunk_size=1000)
documents = pdf_reader.read("large_document.pdf")
# Returns: [Document(chunk1), Document(chunk2), Document(chunk3), ...]
```

### Chunking Strategy Support

Different readers support different chunking strategies based on their content type:

```python
@classmethod
def get_supported_chunking_strategies(cls) -> List[ChunkingStrategyType]:
    return [
        ChunkingStrategyType.DOCUMENT_CHUNKING,  # Respect document structure
        ChunkingStrategyType.FIXED_SIZE_CHUNKING, # Fixed character/token limits
        ChunkingStrategyType.SEMANTIC_CHUNKING,   # Semantic boundaries
        ChunkingStrategyType.AGENTIC_CHUNKING,    # AI-powered chunking
    ]
```

## Reader Factory and Auto-Selection

Agno provides intelligent reader selection through the `ReaderFactory`:

```python
# Automatic reader selection based on file extension
reader = ReaderFactory.get_reader_for_extension(".pdf")  # Returns PDFReader
reader = ReaderFactory.get_reader_for_extension(".csv")  # Returns CSVReader

# URL-based reader selection
reader = ReaderFactory.get_reader_for_url("https://youtube.com/watch?v=...")  # YouTubeReader
reader = ReaderFactory.get_reader_for_url("https://example.com/doc.pdf")     # PDFReader
```

## Supported Readers

The following readers are currently supported:
| Reader Name               | Description                                                          |
|---------------------------|----------------------------------------------------------------------|
| ArxivReader               | Fetches and processes academic papers from arXiv                     |
| CSVReader                 | Parses CSV files and converts rows to documents                      |
| FieldLabeledCSVReader     | Converts CSV rows to field-labeled text documents                    |
| FirecrawlReader           | Uses Firecrawl API to scrape and crawl web content                   |
| JSONReader                | Processes JSON files and converts them into documents                |
| MarkdownReader            | Reads and parses Markdown files                                      |
| PDFReader                 | Reads and extracts text from PDF files                               |
| PPTXReader                | Reads and extracts text from PowerPoint (.pptx) files                |
| TextReader                | Handles plain text files                                             |
| WebsiteReader             | Crawls entire websites following links recursively                   |
| WebSearchReader           | Searches and reads web search results                                |
| WikipediaReader           | Searches and reads Wikipedia articles                                |
| YouTubeReader             | Extracts transcripts and metadata from YouTube videos                |


## Async Processing

All readers support asynchronous processing for better performance:

```python
# Synchronous reading
documents = reader.read("file.pdf")

# Asynchronous reading - better for I/O intensive operations
documents = await reader.async_read("file.pdf")

# Batch processing with async
tasks = [reader.async_read(file) for file in file_list]
all_documents = await asyncio.gather(*tasks)
```

## Usage in Knowledge

Readers integrate seamlessly with Agno Knowledge:

```python
from agno.knowledge.reader.pdf_reader import PDFReader

# Custom reader configuration
reader = PDFReader(
    chunk_size=1000,
    chunking_strategy=SemanticChunking(),
)

knowledge_base = Knowledge(
    vector_db=vector_db,
)

# Use custom reader
knowledge_base.add_content(
    path="data/documents",
    reader=reader  # Override default reader
)
```

## Best Practices

### Choose the Right Reader
- Use specialized readers for better extraction quality
- Consider format-specific features (PDF encryption, CSV delimiters, etc.)

### Configure Chunking Appropriately
- Smaller chunks for precise retrieval
- Larger chunks for maintaining context
- Use semantic chunking for structured documents

### Optimize for Performance
- Use async readers for I/O-heavy operations
- Batch process multiple files when possible
- Cache readers through ReaderFactory when processing many files

### Handle Errors Gracefully
- Readers return empty lists for failed processing
- Check reader logs for debugging information
- Provide fallback readers for unknown formats

## Next Steps

<CardGroup cols={2}>
  <Card title="Chunking Strategies" icon="scissors" href="/concepts/knowledge/chunking/overview">
    Learn how to optimize content chunking for better search results
  </Card>
  <Card title="Content Types" icon="file-lines" href="/concepts/knowledge/content_types">
    Understand different ways to add information to your knowledge base
  </Card>
  <Card title="Vector Databases" icon="database" href="/concepts/vectordb/overview">
    Choose the right storage solution for your processed content
  </Card>
  <Card title="Examples" icon="code" href="/examples/introduction">
    See readers in action with practical examples
  </Card>
</CardGroup>